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We describe an active-set method for the minimization of an objective function (j) that is the 
sum of a smooth convex function and an ^i-regularization term. A distinctive feature of the 
method is the way in which active-set identification and second-order subspace minimization 
steps are integrated to combine the predictive power of the two approaches. At every itera¬ 
tion, the algorithm selects a candidate set of free and fixed variables, performs an (inexact) 
subspace phase, and then assesses the quality of the new active set. If it is not judged to be 
acceptable, then the set of free variables is restricted and a new active-set prediction is made. 

We establish global convergence for our approach, and compare the new method against the 
state-of-the-art code LIBLINEAR. 
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1. Introduction 

The problem of minimizing a composite objective that is the sum of a smooth convex 
function and a regularization term has received much attention; see e.g. [3, Hi and the 
references therein. This problem arises in statistics, signal processing, machine learning 
and in many other areas of applications. In this paper we focus on the case when the 
regularizer is defined in terms of an £i-norm, and propose an algorithm that employs a 
recursive active-set selection mechanism designed to make a good prediction of the active 
subspace at each iteration. This mechanism combines first- and second-order information, 
and is designed with the large-scale setting in mind. 

The problem under consideration is given by 


min (^(x) = f{x) + b\\x\ 
xSM" 


( 1 ) 


We assume that / is a smooth convex function and /x > 0 is a fixed penalty parameter. 

The algorithm proposed in this paper is different in nature from the most popular 
methods proposed for solving problem ([1]). These include first-order methods, such as 
ISTA, SpaRSA and FISTA [1, i 0, and proximal Newton methods that compute a 
step by minimizing a piecewise quadratic model of ([T]) using (for example) a coordinate 


descent iteration 


21 


The proposed algorithm also differs 


from methods that solve ([T|) by reformulating it as a bound constrained problem; for e.g. 
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[13, El, m, [13, Q. 

Our algorithm belongs, instead, to the class of orthant-based methods [l|] that minimize 
a smooth quadratic model of 0 on a sequence of orthant faces of M"" until the optimal 
solution is found. But unlike the orthant-based methods described in UM and the 
bound-constrained approaches in [13], every iteration of our algorithm consists of a 
corrective cycle of orthant-face predictions and snbspace minimization steps. This cycle is 
terminated when the orthant-face prediction is deemed to be reliable. After a trial iterate 
has been computed, a globalization mechanism accepts or modifies it (if necessary) to 
ensure overall convergence of the iteration. 

The idea of employing a correction mechanism for refining the selection of the orthant 
face was introduced in [3| for the case when / is a convex quadratic function. That 
algorithm is, however, not competitive with state-of-the-art methods in terms of CPU 
time because each iteration requires the exact solution of a subspace problem, which 
is expensive, and because the orthant-face prediction mechanism is too liberal and can 
lead to long corrective cycles. These deficiencies are overcome in our algorithm, which 
introduces two key components. We employ an adaptive filtering mechanism that in 
conjunction with the corrective cycle yields an efficient prediction of zero variables at 
each iteration. We also design a strategy for solving, inexactly, the subproblems arising 
during each corrective step in a way that does not degrade the accuracy of the orthant- 
face prediction and yields important savings in computation. We show that the algorithm 
is globally convergent for strongly convex problems. Numerical tests on a variety of 
machine learning data sets suggest that our algorithm is competitive with a leading 
state-of-the-art code. 

The main features of our algorithm can also be highlighted by contrasting them with 
recently proposed proximal Newton methods for solving problem ([T|). The algorithms 
proposed by [13, [ 21 I . [sH and others first chose an active set of variables using first-order 
sensitivity information. The active variables are set to zero, and the rest of the variables 
are updated by minimizing a piecewise quadratic approximation to ([I]) given by 


q'^ix) = f{x^) + {x- x^fVf{x^) + \{x- x’^fV^f{x’^){x - x’^) + /i||x||i. (2) 

This minimization is performed inexactly using a randomized coordinate descent method. 
After a trial iterate is computed in this manner, a backtracking line search is performed 
to ensure decrease in 4>{x). 

The proximal Newton methods just outlined employ a very simple mechanism (the 
minimum norm subgradient) to determine the set of active variables at each iteration. 
On the other hand, they solve the sophisticated lasso subproblem ([2]) that inherits the 
non-smooth strnctnre of the original problem and permits iterates to cross points of 
non-differentiability of (f){x). The latter property allows proximal Newton methods to 
refine the active set with respect to its initial choice. In contrast, our method invests a 
significant amount of computation in the identification of a working orthant face in M”, 
and then minimizes a simple smooth quadratic approximation of the problem on that 
orthant face, 

q’^ix) = f{x^) + {x- x'^fVfix^) + ^ix- x^fV^f{x^){x - x^) + (3) 

where C is an indicator with values 0, 1 or -1, that identifies the orthant face. The working 
orthant is selected carefully, by verifying that the predictions made at each corrective 
step are realized. We do so because a simpler selection of the orthant face, snch as that 
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performed in the OWL method [l|], or the method described in can generate poor 
steps in some circumstances. 

Given that the two approaches (proximal Newton with coordinate descent solver and 
our proposed method) are different in nature, it is natural to ask if one of them will 
emerge as the preferred second-order technique for the solution of problem ([T]). To an¬ 
swer this question we compared a MATLAB implementation of our approach on binary 
classification problems with the well-known solver LIBLINEAR (written in C), based 
on CPU time. One of the main conclusions of this paper is that both approaches have 
their strengths. Orthant-based methods have the attractive property that the subspace 
minimization can be performed by a direct linear solver or by an iterative method such 
as the conjugate gradient method, which is efficient on a wide range of applications. 
On the other hand, the proximal Newton approach method is very effective on appli¬ 
cations where the Hessian matrix is diagonally dominant (or nearly so). In this case, 
the coordinate descent iteration is particularly efficient in computing an approximate 
solution of problem ([2]) . Both approaches share the need for effective criteria for deciding 
when an approximate solution of the subproblem is acceptable. Most implementations 
of the proximal Newton method employ adaptive techniques (heuristics or rules based in 
randomized analysis), while our implementation employs the classic termination criteria 
based on the relative error in the residue of the linear system 0 - 

This paper is organized in 5 sections. In Section [2] we outline the algorithm, paying 
particular attention to the orthant-face identification mechanism. Section [3] discusses the 
procedure by which we safeguard against poor steps and ensure global convergence of 
the algorithm. In Section [H we present a comparison of our algorithm against the state- 
of-the-art code LIBLINEAR for the solution of binary classification problems; some final 
remarks are made in Section [5l 


2. The Proposed Algorithm 


The algorithm exploits the fact that the objective function cf) is smooth in any orthant 
face of M”, which is defined as the intersection of an orthant in R” and a subspace 
{x : Xj = 0, i € I C {1,... re}}. 

At every iteration, the algorithm identifies an orthant face in R"' using sensitivity 
information, performs a minimization on that orthant face to produce a trial point, refines 
the orthant-face selection (if necessary), and repeats the process until the choice of the 
orthant face is judged to be acceptable. Upon termination of this cycle, a backtracking 
line search is performed where the trial points are projected onto the active orthant. 

To describe the algorithm in detail, we introduce some notation. Let g{x) denote the 
minimum norm subgradient of the objective function ([T]) at a point x. Thus, we have 


9i{x) 


Vif{x) -|- /i if Xj > 0 or {xi = 0 and Vj/(x) + p < 0) 
< Vif{x) — /i if Xi < 0 or (xj = 0 and Vi/(x) — p > 0) 

0 otherwise, 

\ ' 


for i = 1,... re, where 


V./(x) 


dfjx) 

dxi 


( 4 ) 


3 
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At an iterate , we define three sets: 

= {i\ xf = 0 and |Vj/(x^)| < (5) 

7-^ = {i|xfyo} (6) 

= {i\x^ = 0 and |Vj/(x^)| > /z}. (7) 

The variables in are kept at zero (since the corresponding components of gi{x^) are 
zero), while those in are free to move. The remaining variables are in the set lA^. The 
decision of which of these are allowed to move significantly impacts the efficiency of the 
algorithm. Using the selection mechanism described below, we first create a partition of 

= Ua^Uf, ( 8 ) 

where the variables in IAa are fixed at zero and the variables in Up are allowed to move. 
We then update the active set as 


A^ ^A^VJ Ua, (9) 

and compute a trial step as the (approximate) solution of the smooth quadratic 
problem 


min V’(d) = + -d^H^d 

s.t. dj = 0, i € A’^, (10) 

where = V^/(x^). The trial iterate is defined as 


We then start the corrective cycle and check whether all variables in the set Up moved 
as predicted; i.e., whether 

sgn([x^]i) = sgn(-[ 5 (x^)]i) for all i € Up. (11) 

Any variable j G Up for which this equality does not hold, is removed from the set Up 
and added to Ua- The set A^ is then updated according to ([9]) and a new trial step 
is recomputed by solving (1101) . We repeat this corrective cycle until all predictions are 
correct and the trial point x^ satisfies dUD. 

The algorithm then performs a projected backtracking line search along d^ to ensure 
that the resulting point yields a decrease in the piecewise quadratic model q^(x) defined 
in ([2]). (We do not perform the line search on the objective function ([1]) as that is 
more expensive, and the globalization mechanism described in Section [3] only requires a 
decrease in q^{x).) 

At iteration k, we identify the current orthant face based on sensitivity information 
dH) and define the vector by 


sgn([x^]i) if [x% A 0 
sgn(-[5r(x'')]i) if = 0. 


( 12 ) 
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Let V^{x) be the projection operator that projects x G onto the orthant defined by 
C^; i-e., 


= if ( 13 ) 

1^0 otherwise. 

We then search for the largest step size a G {2°, 2~^, 2~^, ■ ■ ■ } such that 

g(x^)>q(P^(x'‘ + a-d^)), 

where q is the non-smooth quadratic approximation given by ([2]). Such a step size exists 
because is a descent direction for the smooth quadratic function q^ and because 
the trial point lies within the orthant defined by for sufficiently small steps (see 
Theorem lA.41 in the appendix). 

Before giving a detailed description of the algorithm, we describe the selection mech¬ 
anism that, at the beginning of each corrective cycle, defines the splitting ([8|) of the set 
into variables Ua, that are kept at zero, and variables Up, that are allowed to move. 

At the start of the algorithm, we select a scalar r] G (0,1) and set \Uf\ = [rj xn\; 

i.e., the cardinality of the set Up is a fraction of the dimension of the problem. On 
subsequent iterations, we update the parameter based on its previous value and 
the number of iterations in the previous corrective cycle. If there were no corrections in 
the previous corrective cycle, we set = 2r^ to allow more variables to change at the 
next outer iteration; otherwise we keep the value of unchanged. Since the number of 
variables in Up cannot be larger than \U^\, the actual size of Up is given by 

\Up\ = = min{\U^\,T’^). 

We use a greedy strategy to populate the sets Ua and Up: we collect in Up the variables 
in U^ with the largest components of the subgradient \g{x^)\. Thus, for any i G Up and 
j G Ua, we have \gi{x^)\ > \gj{x^)V 

A formal description of the overall method is given in Algorithm [TJ 
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Algorithm 1 Preliminary Adaptive Orthant-Based Method 
1 : Given € M”, L > 0, fi > 0, rj & {0, 1). 

Let = [rj X n\. 

2 : while k = 0,1,2, ■■ ■ and stopping criterion not met do 
3: Active-Set Identification: 

= {i\ixi)^ = 0 and |Vi/(x^)| < fi} 

= {i|(x.)V 0} 

= {i\ixi)’^ = 0 and |Vj/(3:^)| > fi} 

4: Selection Mechanism: 

Compute g{x^) by (jH and by (fT^ . 

5: Set min(|W*^|,T^). 

6: Choose Uf,Ua C such that Up V\Ua = 0, \kiF\ = and for any i G Up and 

j G Ua, \9i{x'")\ > \gj{x^)V 

7: Set ^ UUa- 

8: Compute or update second-order approximation 

9: Corrective Cycle: 

Set Up and j ■<— 0. 

10 : while 7 ^ 0 do 

11 : 

= argmin d^g{x'^) + \d^H^d 
di=0,ieA’‘ 

12 : Set X^ X^ + dfi. 

13: Set = {i ^ Up \ A^|sgn(Cf) A sgn(xf)}. 

14: Set U and j ^ j + 1. 

15: end while 

16 : if j = 1 then 

17: Set = 2-tA 

18 : end if 

19: Projected Line Search: 

Set a ■<— 1. 

20 : while q{x^) > q{V^{x^ + a ■ dfi)) do 

21 : Set a a/2. 

22 : end while 

23: Set = V^{x^ + a ■ dfi). 

24: end while 


In this paper we assume that the quadratic model (jlOj) employs exact Hessian infor¬ 
mation, i.e. = V^/(x^), and that we perform an approximate minimization of this 
problem using the conjugate gradient method in the appropriate subspace of dimension 
(n — 1^*1); see e.g., [13] • The matrix can also be defined by quasi-Newton updates, 
specifically using the compact representations of limited-memory BFGS matrices j^. 
Although we do not explore a quasi-Newton variant in this paper, we expect it to be 
effective in many applications. 

Our selection mechanism for defining the splitting ([8]) is motivated by the following 
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considerations. If all variables in were allowed to move, the algorithm would have 
similar properties to the OWL method [ll], whose performance is not uniformly successful 
(see Section SI). Indeed, we observed a more reliable performance when the size of Up 
is limited. This also has computational benefits because the subproblem m is less 
expensive to solve when the number of free variables is smaller (i.e., when the set 
is larger). On the other hand, in the extreme case \Uf\ = 1, the algorithm resembles a 
classical active-set method, which is not well-suited for large-scale problems. 

These trade-offs are addressed by the dynamic strategy employed in steps 5 and 17 of 
Algorithm [U Initially, we choose Up to he a. small subset ofU^ (by selecting rj to be small). 
The algorithm increases the size of Up in subsequent iterations if there is evidence that 
the current choice is too restrictive. As as indicator we observe the number of iterations in 
the previous corrective cycle. A small number of corrections (in our implementation this 
number is 1) suggests that the choice of Up may be too conservative and the size of Up is 
doubled at the next outer iteration. We have found that this selection mechanism leads 
to more gradual and controlled changes in the active set compared to other orthant-based 
methods like OWL and the method proposed in Section 5 of [^. 

The projected backtracking line search in Algorithm [1] differs from that used by other 
orthant-based methods in that it is based on the quadratic model and not the objective 
function. As in other orthant-based methods, the projection promotes sparsity in the 
iterates and provides some control for steps that leave the current orthant, outside of 
which the smooth approximation in (|10l) is not valid. But in contrast to other orthant- 
based methods, such as OWL, the line search is not the main globalization mechanism 
in our algorithm, as described next. 


3. Globalization Strategy 

While Algorithm [U generally works well in practice, it may fail (cycle) when the changes 
in the active set are not sufficiently controlled. By adding a globalization mechanism we 
ensure that all iterates generated by the algorithm provide sufficient reduction in the 
objective function and converge to the solution. Our mechanism employs the iterative 
soft-thresholding algorithm (ISTA) [1,[^ to generate a reference point. Because the ISTA 
method enjoys a global linear rate of convergence on strongly convex problems, it provides 
a benchmark for the progress of our algorithm. 

We modify Algorithm [T] as follows. The iterate computed in line 23 is now regarded as 
a trial iterate and denoted by x^. To decide if this point is acceptable we check whether it 
produces a lower function value than the ISTA step computed from the starting point of 
the iteration, x^. If so, we accept the trial point; otherwise, we search along the segment 
joining x^ and the ISTA point to find an acceptable point. Given a Lipschitz 

constant L for the gradient of /, the cost of computing the ISTA step is negligible since 
gradient information is already available at x^. However, the evaluation of (;i>(x(^g.j.^) incurs 
an additional cost. To get around this expense, we use the value of an upper quadratic 
approximation of (j) at xfg.p^ as a surrogate to 0(xfg.p^). More specifically, assuming that 
L is a Lipschitz constant of V/, we define the value of the surrogate function as 

r‘ = /(I*) + v/(i‘)^(4„ - x^) + r||4„ - x^wl + mII4™IIi- (u) 

The computation of T^ requires only one inner product. The complete version of the 
algorithm, including the globalization mechanism, is given in Algorithm [2j 
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Algorithm 2 Orthant-Based Adaptive Method (OBA) 

1 : Given € 3?”, L > 0, /i > 0, 77 € (0,1) and e > 0. 

Let = [?7 X n\ 

2 : while A; = 0, 1,2, • • • and stopping criterion not met do 
3: Carry out steps 1 - 22 of Algorithm [TJ 

4: Set = V^{x^ + a ■ d^). 

5; Globalization: 

Compute ISTA step at x^ as 

4TA = 5^/L(x"-iV/(x")) 

where Sa(x) is a component-wise operator defined as Sa(x}i = maxjlxil — a,0} • 
sgn(a;i). 

Set ^ and d ^ 1. 

Calculate using IfTO . 

while 4'(xjgrj,j^ + a ■ d^) > do 
Set a <— 6il2. 

if a < e then 
Set d ■<— 0. 
end if 
end while 

Set + a ■ df^. 

end while 
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The following convergence result is proven in the appendix. 

Theorem Assume that f is continuously differentiable and strongly convex and that 
V/ is Lipschitz continuous. Then, the iterates {x^} generated by Algorithmic converge 
to the optimal solution x* of problem © at a linear rate. 


4. Numerical Experiments 


In this section, we demonstrate the viability of our approach. While our method applies 
to any convex function with an additive .^i-regularizer, we focus on the specific problem of 
binary classification using logistic regression. This problem is well studied with theoretical 
guarantees and many data sets available of varying sizes, structures and fields of study. 
Further, the results reported on this problem are representative of the performance of 
OBA on other functions (including multi-class logistic regression, probit regression and 
LASSO) where similar trends are observed. We direct the reader to [10|] and the references 
therein for details regarding the function / and the statistical justifications of this choice. 

The data sets chosen for comparison are listed in Table [H Synthetic is a randomly gen¬ 
erated, balanced, non-diagonally dominant problem; the process for generating this prob¬ 
lem is described in the appendix. Alpha is a data set from the Pascal Large Scale Learn¬ 
ing Challenge [l^. Both these data sets have been feature-wise normalized to [—1,1]. 
Details for the other data sets along with their preprocessing steps can be found in 
http://www.csie.ntu.edu.tw/~cjlin/liblinear and the references therein. 

A variety of methods has been proposed for solving problem ([T|), and high-performance 
implementations of some of these methods are available. One of the most popular codes 
is newGLMNET which is a C-implementation of a proximal Newton method and is 
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Table 1. Data sets 


Data set 

number of data points 

number of features 

Gisette 

6000 

5000 

RCVl 

20242 

47236 

Alpha 

500000 

500 

KDDA 

8407752 

20216830 

KDDB 

19264097 

29890095 

Epsilon 

400000 

2000 

News20 

19996 

1355191 

Synthetic 

5000 

5000 


a part of the LIBLINEAR package. Every iteration of this method identifies the active 
set as a subset of as defined in (l5|), and then solves problem Q inexactly using a 
randomized coordinate descent algorithm. The termination criterion for this inner loop 
is based on the .^i-norm of the minimum norm subgradient and adjusted by a heuristic 
as the iteration progresses. 

We implemented Algorithm [2] in MATLAB, where we chose r/ = 0.01 and e = 10“^. 
Subproblem (IlOp is solved inexactly via the conjugate gradient algorithm. The termina¬ 
tion criterion is based on the relative tolerance of the linear system: The inner loop stops 
as soon as the conjugate gradient iterate p satisfies 


Ib'^lloo 


< 0.1. 


We also compare with the OWL method [ij], as implemented by Schmidt 22]. The OBA 
algorithm with the selective-corrective mechanism removed is somewhat related to OWL. 
The primary differences between the two include the procedure of handling the active 
set constraints in the subproblem and the alignment step included in OWL. Specifically, 
OWL minimizes the quadratic model in (|10[) over M”, aligns the search direction and 
then carries out a projected line search onto the active set. 

Eor all test problems, the regularization parameter p was chosen through a 5-fold cross 
validation. LIBLINEAR and OBA use the exact Hessian in defining the quadratic model 
()10p and ([2]) while OWL uses a limited-memory BEGS approximation. Eurther, in order 
to solve singular problems, LIBLINEAR adds a small multiple (specifically, 10“^^) of the 
identity to the Hessian and our algorithm uses the value of 10“®. LIBLINEAR employs 
a secondary mechanism to guard against singularity: it projects the result of the one 
dimensional optimization in the coordinate descent step onto the set [—10,10]. 


4.1 Test Results 

The comparison of the method proposed in this paper, Algorithm 2 (OBA), against 
LIBLINEAR and OWL is presented in Eigures [T] and O We plot the relative function 
error defined as 


(/)(a:^) — 4>{x*) 
1 -|- ^(x*) 


(15) 


against CPU time. The value of 4>{x*) was obtained by running our algorithm to a tight 
tolerance of 10“^*^ or until a time limit of 5000 CPU seconds was exceeded. The tolerance 
used corresponds to the one defined in [^. The initial iterate for all methods was the 
zero vector. 
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Figure 1. Relative error 03 in the objective function (vertical axis) versus CPU time - Part 1 


We can see that for RCVl and News20, the performance of OBA is inferior to LIBLIN- 
EAR; however, for KDDA, KDDB, Epsilon and Synthetic, the performance is superior. 
For Gisette and Alpha, the performance is roughly comparable irrespective of the value 
of the relative function error. OWL has gained a reputation as an algorithm which per¬ 
forms well but unreliably so. The experiments support this opinion. For problems like 
KDDA or KDDB, the performance of OWL is superior to both LIBLINEAR and OBA; 
however, for other problems like Synthetic, Alpha, Epsilon and Gisette, OWL fails to be 
competitive due to poor steps and rapid changes in working orthant faces. 

We emphasize that the improved performance of the proposed method is driven by 
the selective-corrective mechanism as opposed to the ISTA backup. The backup was 
never required in the reported experiments for our method which, in contrast to other 
orthant-based methods, enjoys global convergence properties. 


4.2 Sparsity 

It is natural to ask whether an orthant-based method such as OBA is as effective at 
generating sparsity in the solution as a proximal Newton method, such as LIBLINEAR. 
In proximal Newton methods the non-smoothness of the original problem is retained 
in the subproblem ([2|), and sparsity arises because the solution of the subproblem typ- 
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Figure 2. Relative error 03 in the objective function (vertical axis) versus CPU time - Part 2 


ically lies at points of non-differentiability. In contrast, orthant-based methods solve a 
series of smooth problems that have no tendency of inducing sparsity in the solution 
by themselves, but achieve it through the projection of the trial point onto the working 
orthant. 

Both methods, LIBLINEAR and OBA, also promote sparsity through the definition of 
the active set at the beginning of each (outer) iteration, but the construction of the active 
set differs in the two methods. LIBLINEAR fixes only a subset of the variables in the set 
to zero; thus allowing some variables in and all variables in the set lA^ to move. On 
the other hand, OBA fixes all of the variables in to zero and additionally fixes more 
variables in lA^ through the selection mechanism and the corrective cycle. Therefore, the 
approach in LIBLINEAR can be considered more liberal in that it releases more zero 
variables, while the approach in OBA can be regarded as more restrictive. Nevertheless, 
OBA becomes increasingly more liberal as the iteration progresses because the selection 
mechanism allows the size of the set Up to double under certain circumstances (see step 
17 of Algorithm [1]) . 

In the light of these algorithmic differences, it is difficult to predict the relative ability 
of the two methods at generating sparse solutions. To explore this, we performed the 
following experiments using our data sets and recorded the sparsity in the solution. 
LIBLINEAR was used to solve the problems with the tolerance of their stopping criterion 
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Table 2. Percentage of Zeros in Solu¬ 
tion 


Data set 

LIBLINEAR 

OBA 

Gisette 

88.92 

90.52 

RCVl 

97.62 

97.65 

Alpha 

5.60 

5.20 

KDDA 

98.43 

98.71 

KDDB 

97.11 

97.87 

Epsilon 

44.60 

69.20 

News20 

99.60 

99.37 

Synthetic 

56.86 

58.82 


set to 10“^ (which yielded better misclassification rates than the default value of 10“^), 
and OB A was then used to solve the problems to a similar accuracy in the objective 
function. The results are presented in Table [2] and show that the two methods achieve 
similar values of sparsity, with OBA being somewhat more effective. 


4.3 Conjugate Gradients versus Coordinate Deseent 

Let us now focus on the methods used for solving the subproblems that incorporate 
second-order information about the objective function. It is natural to employ the con¬ 
jugate gradient (CG) method in OBA, given that the subproblem ([3|) is smooth and that 
the CG method is an optimal Krylov process that can exploit problem structure effec¬ 
tively. An alternative to the CG method is the randomized coordinate descent algorithm, 
which has gained much popularity in recent years [13, Ii3,|2^ 

A drawback of coordinate descent for smooth unconstrained optimization is that it can 
be slow when the Hessian is not diagonally dominant. We experimented with a coordinate 
descent solver for the subproblem in OBA and found that its overall performance is 
inferior to that of the GG method. 

The situation is quite different in a proximal Newton method where the subproblem 
is non-smooth. In that case, it is easy to compute the exact minimizer of m along 
each coordinate direction, thereby dealing explicitly with the non-differentiability of the 
original problem. Since this one dimensional minimization may return zero as the exact 
solution, the proximal coordinate descent method provides an active-set identification 
mechanism for the overall algorithm. Thus, although sensitivity to the lack of diagonal 
dominance may still be present, it is of a lesser concern due the benefits of its active set 
identification properties. Furthermore, applications in text classification and other areas 
often lead to problems with Hessians that are diagonally dominant liil. 

This discussion motivates us to look more closely at the issue of diagonal dominance 
and its effect on the two methods. In order to quantify the level of diagonal dominance, 
we use the metric employed, for example, in [29|]. Given any symmetric matrix A, we 
define the level of diagonal dominance of A as 

maxj \\Ai\\2 
maxj \Aii\ 




where A* denotes the column of A and An denotes the diagonal element of A. 
The smaller the value of T), the closer is A to being diagonally dominant. In Table [3] we 
report the values of V (V^/(x*^)) for all data sets in Tabled] 

Let us begin by considering problem Synthetic, which was specifically constructed 
to have a high value of T> (see Appendix |B] for details). We observe from Figure [2| 
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Table 3. The value of f{x^)) as 

defined in Gil 


Data set 


Gisette 

57.99 

RCVl 

1.88 

Alpha 

9.93 

KDDA 

1.80 

KDDB 

1.50 

Epsilon 

5.55 

News20 

3.29 

Synthetic 

69.42 


that LIBLINEAR performs poorly compared to OBA. This may be an indication that 
proximal Newton methods are sensitive to a lack of diagonal dominance. In fact, by 
altering this problem so that "D increases, the performance of LIBLINEAR deteriorates. 
The text classification tasks (RCVl and News20), which are empirically observed to 


be diaTOnally dominant |lll |. have low values of "D and indeed, LIBLINEAR converges 
quickljfJ. However, LIBLINEAR also performs well on problem Gisette for which T) 


is large and poorly on KDDB for which D is low. An examination of the rest of the 
results prevents us from establishing a clear correlation between the value of "D and the 
relative performance of the two methods. We conclude that in £i-regularized problems 
the adverse effects of diagonal dominance appear to be less pronounced than for smooth 
optimization. Other factors such as the frequency of orthant changes and the inexactness 
in the subproblem solution may also play a crucial role in explaining the performance 
differences. The identification of problem characteristics that determine which method 
performs better for a given instances requires further investigation. 


5. Final Remarks 

In this paper, we presented a second-order algorithm for solving convex .^i-regularized 
problems. At each iteration, the algorithm tries to predict the orthant face containing the 
solution, solves a smooth quadratic subproblem on this orthant face, and then invokes a 
corrective cycle that greatly improves the efficiency and robustness of the algorithm. We 
globalized the method by using the ISTA step as a reference for the desired progress. This 
enabled us to prove a linear convergence rate of the iterates for strongly convex problems. 
The ISTA backup is rarely used in practice (and never in the reported experiments) and 
thus, our theoretical result applies to a very robust method that invokes the safeguarding 
very rarely. This globalization procedure is analogous to a Newton trust-region method 
where the underlying method is known to be very effective but convergence can only be 
proved by overcoming pathological situations with a first-order Cauchy step. Numerical 
experiments for logistic regression data sets show that our algorithm is competitive in 
terms of CPU time with the LIBLINEAR C-code, even though our implementation is in 
MATLAB. The algorithm is also effective in generating sparse solutions quickly. Overall, 
our experiments indicate that orthant-based methods are a viable alternative to proximal 


^Interestingly, the LIBLINEAR website and manual convey that LIBLINEAR is known to perform well on docu- 
ment classification tasks but not necessarily on others. 
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Newton methods. 
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Appendix A. Convergence Analysis 

Recall that we wish to solve the problem 

min (j){x) = f{x)-\-ft\\x\\i. 


For the purpose of our analysis, we make two assumptions: 

Assumption A.l The function f is in and strongly convex with parameter A > 0; 
i.e., for any x,y C M"' and t G [0, !].■ 

f{tx + (1 - t)y) < tf{x) + (1 - t)f{y) - iAt(l - t)||x - y\\l. (Al) 

As shown in Nesterov (2004), for continuously differentiable functions, this assumption 
is equivalent to 

f{y)> f{x) + Vf{xf'{y-x) + ^\\y-x\\l for all G M". (A2) 

Assumption A.2 The gradient of f is Lipschitz continuous with constant L > 0; i.e., 
for any x,y G M”, 


l|V/(a:)-V/(y)||2<A||a:-y||2 

The first theorem shows that the algorithm is well-defined. 

Theorem A.3 The backtracking projected line search (steps 20-22 of Algorithm [I]) 
terminates in a finite number of iterations. 

Proof. Consider the k^^ iteration of Algorithm [TJ For notational simplicity, we drop the 
iteration index and denote the iterate as x, the direction obtained after the corrective 
loop (steps 10-15) as d, and the smooth and non-smooth quadratic approximations as 
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q and q, respectively. Along the same lines, let A and U be the active and unsure sets 
during this iteration. 

We first show that there exists an d > 0 such that for any a > 0 with a < d, we 
have V{x + ad) = x -\- ad. Let Ii = {i € {1,2,3, ••• ,n} : Xi A 0}} X 2 = {f € 
(1, 2, 3, • • • ,n} : Xi = 0}}, and let d > 0 such that d < | y | for all i G Zi with di A 0- 
Let a > 0 be such that a < a and C be defined in (fT^ . We consider two cases: 

• Case 1: f € Xi 

Because a < a < |y|, it is clear that sgn(xi + adi) = sgn(xj) = Cij and therefore 
V{x + ad) = X + ad. 

• Case 2: i G X 2 

By definition, Xj = 0. If i G A in step 19 of Algorithm 1, dj = 0 and sgn(xi + adi) = 
sgn(xj) = Ci- Otherwise, i G Up ^ U, and therefore sgn (—= Q ^ (“I; !}■ Assume 
that Q = 1. Thus, sgn(—= 1, and since = 0 and i G Up, sgn(x^ + d^) = 
sgn{d^) = 1, so dj > 0 which in turn implies adi > 0. The same conclusion can be 
made if Q = —1. Thus, V{xi + adi) = 'T(adi) = adi = Xi + adi. 

Because d is a minimizer of q{x + d) in some subspace, we have q{x + ad) < q{x) for 
sufficiently small a < a. Then, 'P{x + ad) = x-\-ad, i.e., x + ad is in the same orthant as 
X, and therefore q{x + ad) = q{x + ad) < q{x) = q{x). As a consequence, the termination 
condition in the while-loop is satisfied after a finite number of iterations. 


We now show that by ensuring that (j) at the new iterate is no larger than the majorizing 
function T^, we can establish linear convergence. 

Theorem A.4 Suppose that Assumptions IA.il and I A. HI hold. Then, the iterates {x^} 
generated by Algorithmic converge to the optimal solution x* of problem 0 at a linear 
rate. 

Proof. Consider the iteration of Algorithm [2l For notational simplicity, let us drop 
the iteration index and denote the minimum norm subgradient as g, the Hessian ap¬ 
proximation as H, and the iterate as x. Further, as is well known, the ISTA point xj^g.^.^, 
computed in step 5 of Algorithm 2, is the minimizer of a proximal approximation of (f){x), 

xfsTA = argmin/(x) + {y - x)^V/(x) + ^\\y - x\\l + ti\\y\\i. (A3) 

y 2 

Because of Assumption IA.21 for any zi,Z 2 G M”, 

f{z2) < f{zi) + Vf{zi)^{z2 - zi) + ^1^2 - ziWl. (A4) 

In particular, by setting zi = x, Z 2 = we get 

‘('(^ISTa) ~ /(^ISTa) T /x||Xjg.pAl|l 

< /(x) + V/(x)^(xfsTA-x) + ^||xfsTA-2:||i+/^lkfsTAlli = r^- (A5) 

Let us denote the point obtained as a consequence of the globalization mechanism, 
which will be the new iterate, as x+. This corresponds to in step 14 of Algorithm 
[2j Realize that the loop in steps 8-13 of Algorithm 2 terminates finitely because once 


16 








May 19, 2015 


Arxiv'OBA'Paper 


a drops to a value below e, it is set to 0 and then + ad) = and the 

sufficiency condition (step 8 of Algorithm [5]) is trivially satished by (lASp . 

By design, our algorithm generates the point snch that 

4>(x~^) < f(x) + V/(x) — x) + ~||Xjgrp^ — x|| + A^IIXjgrp^lll. (-^6) 

Combining this equation with the fact that xj'g^^ is the minimizer in objective (IA3I) . we 
have for any d G M” and y = x -j- -^d that 


<^(x^} < fix) + V/(x)^ 



L 

A , 

2 

A , 


+ — 
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X + —d 
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~ 2 
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2 ^ 
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where the second inequality follows from ()A2I) with y = x + ^d. In particular, we can set 
d to be X* — X and obtain 


4>ix+) <(j)^x + ^(x* “ ^ 11^* “ ^lli- A7) 

Using Assumption |AA] and the convexity of the ^i-norm, we have, for any zi,Z 2 G M”' 
and t G [0,1], 


4>(tzi + (1 - t)z 2 ) < tcpizi) + (1 - t)4>{z2) - -At(l - t)\\zi - Z 2 \\l. 


Setting zi = x, Z 2 = x*, and t = (l — we get 


L 


X + — (x* - x) ) < —4>ix*) + ( 1 - — ) (pix) - 


L 


L 


2L 


1 - 


L 


IX — X 


Combining this result with (IA7p . we get 


A 


4>ix~^) < ^ </>(x) - 


L 


A 


L 


2L 


1 - 


L 


I X — ^ 112 ffi 


A2 A 


2L 


1 - 


L 


= (t>ix*) + ( 1 - - ) ((/>(x) - (/>(x*)). 


— X 


||2 

Il2 


and therefore 


f'ix'^) — (j){x*) < 



(0(x) - 0(x*)). 
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By reintroducing the iteration index and using recursion, we have that 


(/>(x'=) - (t){x*) < ( 1 - ^ ) (0(x^) - 4>{x*)) 


as required. 


Appendix B. Reproducible Research 

The MATLAB code used to generate the “Synthetic” problem is presented below. Given 
a dimension n, we use the following code snippet to generate the vector of labels (denoted 
by y) and the data matrix (denoted by X). 

y=-l+(rand(n,1)>0.5)*2; 

X = rand(n,n); 

X = X + X>; 

mineig = niin(eig(X)); 

if(mineig<0) 

X = eye(size(X))*mineig*-2+X; 
end 

X = chol(X); 
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